Estimating Posterior Model Probabilities via Bayesian Model Based Sampling

Bayesian Methods in Statistics and Social Sciences II
University of Amsterdam

Merlise Clyde

Duke University

Outline

  • Canonical Regression Model & Bayesian Model Averaging

  • Estimation via MCMC Monte Carlo Frequencies

  • Probability Proportional to Size Sampling in Finite Populations

  • Adaptive Independent Metropolis/Adaptive Importance Sampling

Canonical Regression Model

  • Observe response vector \(\mathbf{Y}\) with predictor variables \(X_1 \dots X_p\).

  • Model for data under a specific model \({\cal M}_\gamma\): \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]

  • Models \({\cal M}_\gamma\) encoded by \(\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T\) binary vector with \(\gamma_j = 1\) indicating that \(X_j\) is included in model \({\cal M}_\gamma\) where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]

  • \(\mathbf{X}_{\boldsymbol{\gamma}}\): the \(n \times p_{\boldsymbol{\gamma}}\) design matrix for model \({\cal M}_\gamma\)

  • \(\boldsymbol{\beta_\gamma}\): the \(p_{\boldsymbol{\gamma}}\) vector of non-zero regression coefficients under \({\cal M}_\gamma\)

  • intercept \(\alpha\), precision \(\phi\) common to all models

Bayesian Model Averaging (BMA)

  • prior distributions on all unknowns \((\boldsymbol{\beta}_{{\cal M}_\gamma}, {\cal M}_\gamma, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) and turn the Bayesian crank to get posterior distributions!

  • for nice priors, we can integrate out the parameters \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) to obtain the marginal likelihood of \({\cal M}_\gamma\) \[\begin{align*} p(\mathbf{Y}\mid {\cal M}_\gamma) & = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}}\\ p({\cal M}_\gamma\mid\ Y) & = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \end{align*}\]

  • posterior distribution of quantities \(\Delta\) of interest under BMA \[ \sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p({\cal M}_\gamma\mid \mathbf{Y}) p(\Delta \mid \mathbf{Y}, {\cal M}_\gamma) \]

  • estimation \(\textsf{E}[\boldsymbol{\mu}\mid \mathbf{Y}]\), \(p(\mathbf{Y}^* \mid \mathbf{Y})\), marginal inclusion probabilities \(P(\gamma_j = 1 \mid \mathbf{Y})\)

MCMC Sampling from Posterior Distribution

Use a sample of models from \(\boldsymbol{\Gamma}\) to approximate the posterior distribution of models

  • design a Markov Chain to transition through \(\boldsymbol{\Gamma}\) with stationary distribution \(p({\cal M}_\gamma\mid \mathbf{Y})\) \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]

  • propose a new model from \(q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})\)

  • accept moving to \(\boldsymbol{\gamma}^*\) with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma})}) \]

  • otherwise stay at model \({\cal M}_\gamma\)

  • models are sampled proportional to their posterior probabilities as \(T \to \infty\)

Estimation in BMA

Estimate the probabilities of models via Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]

  • \(T\) = # MCMC samples

  • \(S\) is the collection of unique sampled models

  • \(n_{\boldsymbol{\gamma}}\) is the frequency of model \({\cal M}_\gamma\) in \(S\)

  • \(n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}\) total number of unique models in the sample

  • asymptotically unbiased as \(T \to \infty\)

Monte Carlo Frequencies

  • fundamentally unsound to a Bayesian ! (O’Hagan 1987, The Statistician)

  • ignores observed information in the marginal likelihoods \(\times\) prior probabilities!

  • Can view MH as a form of Probability Proportional to Size Sampling (PPS) With Replacement

  • can we do better using ideas from Finite Population Sampling?

    • Let \(q({{\cal M}}_i)\) be the probability of selecting \({{\cal M}}_i\)

    • Goal is to estimate \(C= \sum_i^N p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\)

      • Hansen-Hurwitz (HH)
      • Horvitz-Thompson (HT)
      • Hájek
      • Basu/Bayes

Hansen-Hurwitz (HH)

  • Hansen-Hurwitz (1943) may be viewed as an importance sampling estimate \[\hat{C} = \frac{1}{n}\sum_i^n \frac{ n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)} \]

  • If we have “perfect” samples from the posterior then \(q({{\cal M}}_i) = \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{C}\) and recover \(C\)!

  • Since \(C\) is unknown, apply the ratio HH estimator (or self-normalized IS) \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)}}{ \frac1 n \sum_i^n \frac{1}{q({{\cal M}}_i)}} = \left[ \frac{1}{n} \sum_i \frac{n_i}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]

But this recovers the “infamous” harmonic mean estimator of Newton & Raftery (1994) - while unbiased, it’s is highly unstable!

Horvitz-Thompson (HT)

  • inclusion probability that \(\boldsymbol{\gamma}_i \in S\) - under sampling with replacement \(\pi_i = 1 - (1 - q({{\cal M}}_i))^\text{T}\)

  • HT estimate of normalizing constant: \[\quad \hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\] (dominates HH, unique hyper-admissible estimate of \(C\))

  • Hájek (1971) estimator uses an auxilary variable \(A_i > 0\), where we expect \(p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i) \propto A_i\), with \(A \equiv \sum_{i = 1}^{N} A_i\) \[\hat{C} = \frac{\sum_{i=1}^n \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}} {\sum_{i=1}^n\frac{A_i/A}{\pi_i}}\] may be preferable when \(p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)\) are weakly correlated with \(\pi_i\) or when \(n\) is not fixed

Basu and Bayes

Basu’s (1971) famous circus example illustrated potential problems with the Horvitz-Thompson estimator (similar problem arises with IS)

  • violates the likelihood principle

  • once we have samples, \(p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) are fixed and the sampling probabilities are not relevant

  • only randomness is for the remaining units that were not sampled. (which is related to the sampling design)

  • Basu’s estimate (using \(\pi_i = A_i/A\)), \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \left( \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \right) \times \left(\sum_{i \notin S} \pi_i \right)\]

  • conditions on the observed data sum and estimates remaining

Model Based Methods

Basu (1971)’s estimate of the total can be justified as a “super-population” Model Based approach (Meeden and Ghosh, 1983)

  • Let \(m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \beta, \sigma^2 \pi_i^2) \\ p(\beta, \sigma^2) & \propto 1/\sigma^2 \end{align}\]

  • posterior mean of \(\beta\) is \(\hat{\beta} = \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\) (the HT of the total)

  • using the posterior predictive for \(m_i \notin S\), \(\textsf{E}[m_i \mid m_j \in S] = \pi_i \hat{\beta}\) \[\begin{align*} C & = \sum_{i \in \boldsymbol{\Gamma}} m_i = \sum_{i \in S} m_i + \sum_{i \notin S} m_i \\ \hat{C} & = \sum_{i \in S} m_i + \sum_{i \notin S} \hat{\beta} \pi_i = \sum_{i \in S} m_i + \left[\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i} \right] \sum_{i \notin S} \pi_i \end{align*}\]

Final Posterior Estimates

  • estimate of posterior probability \({\cal M}_\gamma\) for \({\cal M}_\gamma\in S\) \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]

  • estimate of all models in \(\boldsymbol{\Gamma}- S\) from the predictive distribution \[ \frac{\frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]

  • Uses renormalized marginal likelihoods of sampled models

  • easy to compute marginal inclusion probabilities

  • Other mean/variance assumptions for the super-population model lead to other estimates for \(C\), \(p({\cal M}_\gamma\mid \mathbf{Y})\), etc

  • What about \(\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\) or \(p(\Delta \mid \mathbf{Y})\)?

Choice for \(q({\cal M}_\gamma)\) or \(\mathbf{A}_{{\cal M}_\gamma}\)?

  • The joint posterior distribution of \(\boldsymbol{\gamma}\) (dropping \(\mathbf{Y}\)) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where \(\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}\) for \(k < j\) and \(p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)\).

  • As \(\gamma_j\) are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where \(\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})\) and \(\rho_{1 \mid < 1} = \rho_1\), the marginal probability.

  • Product of Dependent Bernoullis

Global Adaptive MCMC Proposal

Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]

  • Note: \(\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]\)

  • Fit a sequence of \(p\) regressions \(\gamma_j\) on \(\gamma_{<j}\) \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]

Compositional Regression

Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]

  • Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\boldsymbol{\Gamma}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]

  • \(\mathbf{G}\) is \(T \times p\) matrix where row \(t\) is \(\boldsymbol{\gamma}_t\)

  • \(\boldsymbol{\mu}\) is the \(p\) dimensional vector of \(\textsf{E}[\boldsymbol{\gamma}]\)

  • \(\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}\) where \(\mathbf{U}\) is upper triangular Cholesky decomposition of covariance matrix of \(\boldsymbol{\gamma}\) (\(p \times p\))

  • \(\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}\) (lower triangle)

  • \(\mathbf{B}\) is a \(p \times p\) upper triangular matrix with zeros on the diagonal and regression coefficients for \(jth\) regression in row \(j\)

Estimators of \(\mathbf{B}\) and \(\boldsymbol{\mu}\)

  • OLS is BLUE and consistent, but \(\mathbf{G}\) may not be full rank

  • apply Bayesian Shrinkage with “priors” on \(\boldsymbol{\mu}\) (non-informative or Normal) and \(\Sigma\) (inverse-Wishart)

  • pseudo-posterior mean \(\boldsymbol{\mu}\) is the current estimate of the marginal inclusion probabilities \(\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}\)

  • use pseudo-posterior mean for \(\boldsymbol{\Sigma}\)

  • one Cholesky decomposition provides all coefficients for the \(p\) predictions for proposing \(\boldsymbol{\gamma}^*\)

  • constrain predicted values \(\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)\)

  • generate \(\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})\)

  • use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all) -or- Samping Without Replacement (todo)

Simulation

  • tecator data (Griffin et al (2021))

  • a sample of \(p = 20\) variables

  • compare

    • enumeration to
    • MCMC with add, delete, and swap moves with \(q\)
    • Adaptive Independent MCMC
    • Importance Sampling with HT
  • same settings burnin.it, MCMC.it, thin

MSE Comparision

Continued Adaptation ?

  • can update Cholesky with rank 1 updates with new models
  • how to combine IS with MH samples (weighting) ?
  • HT/Hajek - computational complexity involved if we need to compute inclusion probability for all models based on updates (previous models and future models)
  • Basu (1971) approach works with PPS-WOR take \(\pi_i \propto A_i \equiv q(\boldsymbol{\gamma}_i)\) (adaptation?)

Refinements

  • Want to avoid MCMC for

    • pseudo Bayesian posteriors used to learn proposal distribution in sample design for models
    • estimation of posterior model probabilities in model-based approaches (ie learning \(\beta\), sampling from predictive distribution)
    • estimation of general quantities under BMA?
  • avoid infinite regret

  • more general models?

Summary

  • Adaptive Independent Metropolis proposal for models (use in more complex IS)
  • Use observed values of unique marginal likelihoods of models for estimating posterior distribution
  • Bayes estimates from MC output (solution to O’Hagan ’73?)

Thank You!

Code in development version of BAS on https://github.com/merliseclyde/BAS